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Abstract 



We describe discrete restricted Boltzmann machines: probabilistic graphical mod- 
els with bipartite interactions between discrete visible and hidden variables. These 
models generalize standard binary restricted Boltzmann machines and discrete 
naive Bayes models. For a given number of visible variables and cardinalities of 
their state spaces, we bound the number of hidden variables, depending on the 
cardinalities of their state spaces, for which the model is a universal approximator 
of probability distributions. More generally, we describe exponential subfamilies 
and use them to bound the Kullback-Leibler approximation errors of these models 
from above. We use coding theory and algebraic methods to study the geometry of 
these models, and show that in many cases they have the dimension expected from 
counting parameters, but in some cases they do not. We discuss inference func- 
tions, mixtures of product distributions with shared parameters, and patterns of 
strong modes of probability distributions represented by discrete restricted Boltz- 
mann machines in terms of configurations of projected products of simplices in 
normal fans of products of simplices. 



1 Introduction 

A restricted Boltzmann machine (RBM) is a probabilistic graphical model with bipartite interactions 
between an observed set of units andahidden set of units (see |32, 10, 13, 14|). The RBMprobabil- 
ity model is the set of joint probability distributions on the states of the observed units for all possible 
choices of interaction weights in the network. Typically RBMs are defined with binary units, but 
RBMs with other types of variables have also been considered, including continuous, discrete, and 
mixed type variables, see for instance 1 35"T9', '30', "S", '331 . Also probability models with more gen- 
eral interaction networks have been considered; including semi-restricted Boltzmann machines and 
higher-order interaction Boltzmann machines, see for instance lISTl l20l l26l l28l . While each unit Xi 
of a binary RBM has the state space {0, 1}, the state space of each unit Xi of a discrete RBM is a 
finite set Xi — {0, 1, . . . , r; — 1}. A discrete RBMs is a type of exponential family harmonium. 

We discuss the representational power of discrete RBMs. We generalize previous theoretical results 
on standard, binary, RBMs and discrete naive Bayes models to discrete RBMs. 

A characterizing property of RBMs is that the observed units are independent given the states of 
the hidden units and vice versa. This is a consequence of the bipartiteness of the interaction graph, 
and does not depend on the units' state spaces. Discrete RBMs can be trained (in principle) using 
existing methods, like contrastive divergence (CD) lfT2l [131 IH and expectation-maximization (EM) 
methods |9 |. Like binary RBMs, they can be used to train the parameters of deep learning systems 
layer by layer 1 15 3 1. Compared with general network based models with hidden variables, RBMs 
are much more tractable, even if finding maximum hkehhood estimates of target data distributions 
is usually difficult in either case. 
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Figure 1: Examples of probability models treated in this paper, in the special case of binary visible 
variables. The light (dark) nodes represent visible (hidden) variables. The number inside each node 
indicates the cardinality of the state space of the corresponding variable. From left to right: a binary 
RBM; a discrete RBM with an 8-valued and a binary hidden unit; and a binary naive Bayes model 
with 16 hidden classes. The total number of parameters of each model is indicated at the top. 



A discrete RBM is a product of experts [12|. It has one expert which is a mixture model of prod- 
uct distributions, or naive Bayes model, for each hidden unit. Discrete RBMs interpolate between 
standard binary RBMs and naive Bayes models, which are just discrete RBMs with one single 
hidden unit. They can serve, in particular, to contrast distributed (restricted) mixture representa- 
tions if '231 from binary RBMs and non-distributed (unrestricted) mixture representations from 
naive Bayes models. See Figure [T] 

Naive Bayes models have been studied across many disciplines. In machine learning they are most 
commonly used for classification and clustering, but have also been considered for probabilistic 
modelling I.18J . It is known that they can represent any probability distribution if the number of 
hidden classes is large enough, see [21 1 for tight bounds. In spite of their seeming simplicity, the 
geometry of these models is far from fully understood. Recent theoretical work on binary RBMs in- 
cludes universal approximation properties |10 16 22|, dimension and parameter identifiability |7|, 
Bayesian learning coefficients 1 1 1, complexity 1 17 1, approximation errors |25 1, and distributed mix- 
ture representations I.23J . We shall generalize some of these results to discrete RBMs. 

Section |2] collects basic facts about independence models and hierarchical models, and briefly re- 
views the theory of naive Bayes models and binary RBMs. Section[3]defines discrete RBMs formally 
and describes them as (i) products of mixtures of products (Proposition|8]l, and (ii) as restricted mix- 
tures of products. Section |4] elaborates on the distributed mixtures of products and the inference 
functions represented by discrete RBMs. Proposition 12 Lemma 13 and Proposition 15 address 
the inference functions. Section |5] addresses the expressive power of discrete RBMs by describing 
tractable explicit submodels (Theorem[T6|) and contains results on universal approximation and max- 
imal model approximation errors (Theorem 17 1. Section|6]discusses the dimension of discrete RBM 



models (Proposition 19 and Theorem 21 1. Section|7]contains an algebraic combinatorial discussion 
of tropicalization (Theorem [23|) w ith consequences for the dimension of discrete RBMs collected in 
Propositions 26. a 26.b and 26. c 



2 Preliminaries 



2.1 Independence models 

Consider a system of rt < oo random variables Xi, . . . , X„. Assume that Xi takes states Xi in a 
finite set Xi = {0, 1, . . . , — 1} for alH e {1, . . . . n} —: [n]. The state space of the entire system 
is X = Xi X ■ ■ ■ X Xn- We write x\ = {xi)i^\ for a joint state of the variables with index i E X for 
any A C [n], and x = {xi, . . . , x„) for a joint state of all variables. We denote by A(A') the set of 
all probability distributions on X. We write (a, b) for the product a^b. 

The independence model of Xi, . . . , X„ is the set of product distributions p{x) = Ylie [n] Pi i^i) 
all X G X, where pi is a probability distribution on Xi for all i £ [n]. This model is the closure in 
the Euclidean topology £x of the exponential family 

£;, = {exp((0,AW)):eeM'^-} (1) 

with a matrix A^"^' G ^d^xx sufficient statistics 1 (constant function on X with value one) and 
^{x: xi=yi} for all Hi G Xi\ {0} for all i e [n] (indicator functions of subsets of X). The convex 
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Figure 2: The convex supports of the independence models of three binary variables (left), and of two 
variables, one binary and one ternary (right), discussed in Example T] Both are three-dimensional 
polytopes. The prism has fewer vertices than the cube and is in this sense more similar to a 3- 
simplex. 



support of £x is the convex hull of the columns of A'^'^\ which is a Cartesian product of simplices 
Q;, := conv({Al'^)},g;,) = A{X,) x • • • x A{X„). 

Example 1. The sufficient statistics of the independence models £x and £x' with X = {0, 1}"^ and 
X' = {0, 1, 2} X {0, 1} are, with rows labeled by indicator functions. 
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In the first case the convex support is a cube; in the second it is a prism. See Figure|2] 



2.2 Naive Bayes models 

Let A: e N. The k-mixture of the independence model, or naive Bayes model with k hidden classes, 
on the variables Xi , . . . , X„ is the set of all probability distributions expressible as convex combi- 
nations of k points ivi £x'- 

Mx.k - { E ^^P*'' • ^ e f;, Vi e [fc], ^ A, = i} . (2) 

ie[k] ie[k] 

We write A4n,k for the fc-mixture of the independence model of n binary variables. The dimension 
of the mixtures of binary product distributions are known: 

Theorem 2 (CataUsano, Geramita, and Gimigliano ||5]). The mixture models of binary product 
distributions Mn,k have the dimension expected from counting parameters, min{nfc + (fc — 1) , 2"— 
1}, except when n = A and k — i, when Ain.k has dimension 13 instead of 14. 

Let 2l(A', 2) denote the maximal cardinality of a subset of X of minimum Hamming distance at 
least two, i.e., the maximal cardinality of a subset X' C X with dnix, y) > 2 for all distinct points 
x,y G X', where dnix, y) :— \{i G [n] : Xi ^ yi]\. The function "Qix is familiar in coding theory. 
The fc-mixtures of independence models are universal approximators when k is large enough. This 
can be made precise in terms of 2t(A', 2): 

Theorem 3 ( ETIl ). The model M x.k is equal to A(A') ifk > ,nax g']'^] \x \ ""'^ ""'^ 'f^ — ^('^^ 2). 

When X — {0,1^ . . ,q — 1}" and g is a power of a prime number, then ^x = 9"^^ (see ifTTllMl ). 
and by Theoremp]A^;t,fe = Ax iff fc > (7"^^. In particular, the smallest mixture of products model 
universal approximator of distributions on {0, 1}" has 2"^^(n + 1) — 1 parameters. 

A state a; e A" is a mode of p E A{X) if p{x) > p{y) for all y with dnix, y) — 1. The point a; is a 
strong mode of p if p{x) > J2y: 
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Lemma 4 ( Ii2ri ). If a mixture p = Xip'^''^ of product distributions has strong modes CCA", then 
there is a mixture component p^'^^ with mode x for each x £ C. In particular a mixture ofk product 
distributions has at most k strong modes. 

2.3 Binary restricted Boltzmann machines 

The RBM model with n visible and m hidden binary units, denoted RBM„ „i, is the set of distribu- 
tions on {0, 1}" of the form 

p{x) = ^ ^ eyi^{h^Wx + B^x + C'^h) for all x e {0, 1}" , (3) 

where x denotes states of the visible units, h denotes states of the hidden units, W — {Wji)ji G 
j^mxn ^ matrix of interaction weights, B E M" and C G K"* are vectors of bias weights, and 
Z — J2xe{o 1}" J2he{o 1}™ exp(/i^iya; + x + h) is the partition function. 

It is known that binary RBMs have the expected dimension for many choices of n and m: 

Theorem 5 (|7 1). The dimension o/RBM„ „j is equal to the number of parameters, mn + n + m, 
whenm+1 < 2"-ri°g2("+i)\ and equal to 2"" - 1 when m > 2"-Li°g2("+i)J. 

It is known that binary RBMs are universal approximators when they have enough hidden units: 
Theorem 6 (lEH)- The model RBM„_m equals A{o,i}" whenever m > 2"^^ — 1. 

It is not known whether the bound from Theorem |6] is always tight, but it shows that the smallest 
RBM universal approximator of distributions on {0, 1}" has at most 2"^^(7i + 1) — 1 parameters, 
and hence not more than the smallest mixture of products universal approximator. 

3 Discrete restricted Boltzmann machines 

Let Xi — {0, 1, . . . , Tj — 1} for i G [n] and yj — {0, 1, . . . , Sj ~ 1} for j S [m]. The graphical 
model with full bipartite interactions {{«, j} : i G [n], j G [to]} on A" x 3^ is the exponential family 
£x y '■— exp((0, A^'^'-^^)) : 9 G 'R'^^'^^} with sufficient statistics matrix equal to the Kronecker 
product At-^^^) 

Definition 7. The discrete RBM model RBM;^^ 3; is the set of marginal distributions on X of £x,y- 

The matrix ^(-^^^) has (E.^n (l-^^l " 1) + l) (E,eH(l3^»l " 1) + l) linearly independent 
rows, and jA" x columns, corresponding to the joint states of all variables. The parametriza- 
tion 

Pe(a:,y)-^exp((0,Ag;J')) for all (a;, e A" x 3; (4) 

is one-to-one, disregarding the constant row of A'-'^'-^^ which always cancels out with the normal- 
ization constant. The dimension of Sx.y is equal to the number of rows of A^'^'-^'> minus one. The 
dimension of RBM;^^ j; expected from counting parameters is equal to min{dim(£;t'j;), |A| — 1}. 

In the case of one single hidden unit RBM;^' j; is the naive Bayes model on X with |3^| hidden 
classes. When X = {0, 1}" and y = {0, 1}™, then RBMx,y is a binary RBM. Note that {h'^Wx + 
B^x + C^h) — {0, where 9 is the column by column vectorization of {q ^ )■ 

Consider a parameter vector 9 G M.'^^'^y of £x.y and let Q G M.'^y'^'^^ be a matrix with column by 
column vectorization equal to 9. By Roth's lemma |29J we have the identity 9^^ {A^'^^iS5A'--^^ )(^^ y-^ ~ 
{A^x^yO^A^^^ for all a; e A, y e y. This allows us to write 

(0,Ag;J)) = (eAf),4^)) = (e^4^),4^)) ^xGX^yGy. (5) 
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The inner product in eq.|5]describes following distributions: 

pe(-,-)=^exp((0,v4('^^^)», (6) 

pe{-\x) =^exp((eAW,^(^))), and (7) 
p,(.|2/)=^exp((eT4^),AW)). (8) 

Geometrically, GA^'*^ is a linear projection of the columns of A^'^^ into the parameter space of Ey, 
and similarly, A^^^ is a projection into the parameter space of Ex- 

Polynomial parametrization 

Discrete RBMs can be parametrized not only using the exponential parametrization of hierarchical 
models, but also by simple polynomials. 

The distributions from Exy can be parametrized in the following way (by square free monomials): 

v{vM = \ n (m.},(.;,<))'"^'''^'^^^'"^ %vM^yy.X, (9) 

e [m] X [n], 

where l{j,i},{v'-,x') G The discrete RBM probability distributions can be written as 

^'(^^'"^ n ( 51 mi},('»,,i'i)-"7b,n},(h,,i-„)) VveA". (10) 

ie[m] hj^yj 

Here the parameters ^{j.i},(y'-.x') can be understood as cxp(6'{j jj (j,/ ^'-j), where is a natural pa- 
rameter vector of 8x,y- 

Products of mixtures and mixtures of products 

In the following we describe discrete RBMs from two complementary perspectives: (i) as products 
of experts, where each expert is a mixture of products, and (ii) as restricted mixtures of product 
distributions. 

The renormalized entry-wise product (Hadamard product) of two probability distributions p and q 
on X is defined as p o g = {p{x)q{x)).^^x / Y^yax P{v)^iy)- 

Proposition 8. The model RBM;^' ;y is a Hadamard product of mixtures of product distributions: 

KBMx^y = Mx,\y^\ 0---0 Mx,\y^\ ■ 

Proof. Proposition |8] can be seen by considering the parameterization ( fTO] ). To make this explicit, 
one can use a homogeneous version of the matrix A'*^ -^) which we denote by A and which defines 
the same model. A row of A is indexed by an edge of the bipartite graph and a joint state 

{xi , hj } of the visible and hidden unit connected by this edge. Such a row has a one in any column 
when these states agree with the global state, and zero otherwise. Let Aj^., denote the matrix con- 
taining the rows of A with indexes {xi, hj)) for all Xi G Xi for all i e [n] for all hj e 3^j, 
and let A{x, h) denote the {x, ft,)-column of A. We have 



h 

= ^ X! CXp((6'l,:, h))) exp{{92,:,A2^:{x, h))) ■ ■ ■ CXp{{e„r^., Am,:{x, h))) 

h 

^(^exp((6'i^:,Ai,(a;,/?,i)))) ••• (^exp((6l„,,,A^,:(a;,/i„)))) 



Z . 



-l{Z,p(^\x)) ■ ■ ■ (Z„p(")(a;)) - jj/'Hx) ■ . 
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Fi gure 3; Three 3-slicings of the 2-cube by the fan of the 2-simplex with maximal cones i?o, Ri, 
and i?2- Each vertex of the 2-cube is a column vector of the sufficient statistics matrix of the 2-bit 
independence model. Each vertex of the 2-simplex is a column vector of the sufficient statistics 
matrix of the independence model of one single ternary variable (equal to A({0, 1, 2})). 



where p^^^ e Mx,\yj\ and Zj = Y^xex Y^hjeyj *3xp((0j,:, Aj^:(a;, hj))) for all j e [m]. Since the 
vectors 6j ., can be chosen arbitrarily, the factors p^-^^ can be made arbitrary within -Mx.iy^ \ ■ D 

Of course, every distribution in RBM;^^,^; is a mixture distribution — J2hey P(^\^)l(^)- 
mixture weights are given by the marginals q{h) on y of distributions from £x,y, and the mixture 
components can be described as follows: 

Proposition 9. The set of conditional distributions p{x\h), h ^ y of £x.y is the set of product 
distributions in Ex with parameters Oh ~ A''^\ h E y equal to a linear projection of the 
vertices {A^^^ : h G y} of the Cartesian product of simplices Qy = A(3^i) x • • • x A(3^,„). 

Proof. This is by eq. |5] □ 

4 Products of simplices and their normal fans 

Binary RBMs have been analyzed by considering each of the m hidden units as defining a hyper- 
plane Hj slicing the n-cube into two regions. To generalize the results provided by this analysis, 
in this section we replace the n-cube with a general product of simplices Qx, and replace the two 
regions defined by the hyperplane Hj by the \yj \ regions defined by maximal cones of the normal 
fan of the simplex A {yj ) . 

Subdivisions of independence models 

The normal cone of a polytope Q C M'' at a point x E Q is the set of all vectors v G W'' with 
{v, {x — y)) > for all y £ Q. We denote by the normal cone of the product of simplices 

Qx = conv{Ax }xex at the vertex Ax ■ The normal fan Tx is the collection of all normal 
cones oiQx- 

The product distributions pg — ^ exp{{9, A^'^')) e Sx strictly maximized at x £ X, i.e., which 
satisfy pe{x) > pe{y) Vy £ X \ {x}, are those with parameters 9 in the relative interior of Rx- 

Inference functions and slicings 

For any choice of parameters of the model RBM;i^ 3;, there is an inference function tt: X —> y, 
(or more generally tt : A" — > 2-^), which computes the most likely hidden state given a visible state. 
These functions are not necessarily injective nor surjective. For a visible state x, the conditional 

distribution on the hidden states is a product distribution p{y\X ~ x) = ^ exp((8Ai'^\ A^'y'^)), 
which is maximized at the y for which OA^'^'' e Ry. The preimages of the Ryby Q partition the 
input space W^^ , and are called inference regions. See Figurelsland Example [Tl] 
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Definition 10. A y -slicing of a finite set Z is a partition of Z into the preimages of Ry, the maximal 
cones of J^y, by a linear map Q. We assume that 8 is generic, such that it maps each point in Z into 
the interior of some Ry . 

When y — {0, 1}, the fan Ty consists of a hyperplane and the two closed halfspaces defined by 
that hyperplane. A 3^-slicing is in this case a standard slicing by a hyperplane. 

Example 11. Let X = {0, 1, 2} x {0, 1} and y = {0, 1}". The maximal cones Ry of the normal 

fan of the 4-cube with vertices {0, 1}'* are the closed orthants of M**. The 6 vertices {^i'*^ : x G X} 
of the prism A2 x Ai can be mapped into 6 distinct orthants of each with an even number of 
positive coordinates: 




llllllX /-l-l 1 1 13 

1 1 1 1 _ / 1 1 3-1-11 

1 - -3 1-1-1 3 1 

10 10/ V 1 -3 -1 3-11 



(11) 



Even in the case of one hidden unit, the slicings can be complex, but the following simple type of 
slicing is always available. 

Proposition 12. Any slicing by k — 1 parallel hyperplanes is a {1, 2, . . . , k}-slicing. 

Proof. We show that there is a line £ — {Xr — 5: A G M}, r, 6 G M*^ intersecting all cells of J^y, 
y — {1, . . . ,k}. We need to show that there is a choice of r and b such that for every y £ y the set 
/j, C M of all A with (Ar — b, {sy — e^)) > for all z e 3^ \ {y} has a non-empty interior. Now, ly 
is the set of A with 

Hry - r^) > by-b^ for all z ^y. (12) 
Choosing 61 < • • • < 6^ and ry — .f{by), where / is a strictly increasing and strictly concave 
function, we get h = (-00, '^2^), ^ (^bi^h^ bi±i^\ for v = 2, 3, . . . , A: - 1, and h = 

, 00). The lengths 00, 12, ■ ■ ■ , h-i, 00 of the intervals Ii, . . . ,Ik can be adjusted arbitrarily 
by choosing suitable differences r^+i — Vj for all j = 1, . . . , fc — 1. □ 

Strong modes 

A strong mode of a distribution p on A" is a point x £ X such that p{x) > J^yednix y)=iPiy)' 
where dnix, y) is the Hamming distance between x and y. 

Lemma 13. Let C Q X be a set of arrays which are pairwise different in at least two entries ( a code 
of minimum distance two). IfIUiMx,y contains a probability distribution with strong modes C, then 

there is a linear map of {A^^} into the C-cells (the cones above the codewords in the normal fan) 
of Tx sending at least one vertex into each cell. 

On the other hand, let CCA". If there is a linear map Q of the vertices of Xjg[„j] A^^. into the C- 

cells Rx, X Cz C of the fan Tx, with max2;{(0^Ay"^\ A^'^'')} = cfor all y, then WS\Ax,y contains 
a probability distribution with strong modes C. 

Proof. This is by Proposition [9] Lemma[4]and the definition of the normal fan. □ 

By this lemma, if RBM;t,j^ is a universal approximator of distributions on X, then |3^| > 2l(A', 2). 
Hence discrete RBMs may not be universal approximators even when they have the same dimension 
as the ambient probability simplex. 

Example 14. Let X = {0, 1, 2}" and 3^ {0, 1, . . . , 4}". In this case 2t(A', 2) = 3"-^. When 
n = 3 or n = 4, if RBM;t .j^ is a universal approximator, then m > 2 and m > 3. On the other 
hand, the smallest m for which RBM;t.j^ has 3" — 1 parameters is to = 1 and to = 2. 



Using the analysis of Il23]| gives the following. 

Proposition 15. //'4[m/3] < n, then RBM;t'.j' contains distributions with 2™ strong modes. 
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5 Approximation errors and universal approximation 



In this section we describe certain explicit tractable submodels of the discrete REM and use these to 
provide error bounds. 

Theorem 16. Let dy — 1 + X)j=i(l3^il ~ ^''^ model ^YiMxy can approximate the following 
arbitrarily well: 

• Any mixture of dy product distributions with disjoint supports. 

• When dy > (ni(E[fe] I'^d)/ i^^^je[*:] l'^il> '^^J distribution from the set V of distributions 
with constant value on each block {xi} x • • • x {xk\ x X^^i x ■ • • x Xn for all Xi € Xi, 
for all i e [k]. 

• Any probability distribution with support contained in the union of dy sets of the form 

{Xi} X • • • X {Xk-l} X A'fe X {Xk+l} x • • • X {Xn}- 

Proof. By Proposition [s] the RBM model contains any product p^^^ o • • • o p^™\ where p^^^ g 
■M.x,\yj\ for all j G [m]. In particular, it contains p — o (1 + Aip^^^^) o • • • o (1 + A,„p(™)), 
where p'"^ G Ex and p'^-'' G Mx ,\yj\-i- Choosing the factors p^^^ with disjoint supports results 
in p = Sj=o ^jP'''^ ^ where p'"-' is any product distribution and p'--'' G J^x.\y can be made 
arbitrary for all j G [m], as long as supp(p'--'') n supp(p'-' for all j ^ j' . 

The second item: Any point in P is a mixture of the uniform distributions Pxi,....xk on the blocks 
{xi} X • • • X {xk} X Xk+l X • • • X Xn- These mixture components have disjoint supports and are 
product distributions, since they factorize as Pxi,...,xk — Ilielfc] '^ieln]\[k] where Ui denotes 
the uniform distribution on Xi. For any j G [k], any mixture of the form J2x ex ^xjPxi....,xk is 
also a product distribution which factorizes as 

( ^-.^-.) n n • (13) 

x^ex^ ielk]\U} »e[n]\[fe] 

Hence any point in T-" is a mixture of (IliGi*:] I'^d) / ''^^^jelk] \ product distributions with disjoint 
supports. 

The third item follows from the first item, because £x contains any distribution with support of the 
form {xi} X • • • X {xk-i} ^ Xk x {xk+i} x • • • x {x„}. See 121]. □ 

Let p,q E A(A'). The Kullback-Leibler (KL) divergence from q to p is defined as D{q\\p) := 
X^xeA" ^"^S f|fy when supp(p) I) supp((7) and as D{p\\q) = oo otherwise. The divergence 
fromp to a model Ai C A{X) is defined as D{p\\M) := infggwi D{p\\q). A model M of distribu- 
tions on A" is a universal approximator iff D{p\\Ai) = for all p G A(A'). 

Theorem 17. Let A C [n]. IfYli^[n]\A l-^il < 1 + Z^jefm] (l^j 1^1)= dy, then the KL-divergence 
from any p G A(A') to the model RBM^t",!^ is bounded by 



D{p\\ KBUx,y) < log 



maxigA \Xi\ 

In particular, KByixy is a universal approximator whenever dy > | A'|/max,jg[„] \Xi\. 

Proof. The set V described in the second item of Theorem [16] is known as a partition model. The 
maximal divergence from such a model is equal to the logarithm of the cardinality of the largest 

block. See f25\. We have thus maxp £1(^11 RBMx,y) < maxp D{p\\r) = log J^^^^Jf^l ' ° 

This theorem tells us that the maximal approximation error increases at most logarithmically with 
the total number of visible states, and decreases at least logarithmically with the sum of the number 
of states of the hidden units. This observation could be helpful, for example, in designing a penalty 
term to allow comparison of models with differing numbers of units. 
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Remark 18. The submodels of discrete RBMs described in Lemma 16 can also be used to upper 
bound the expected KuUback-Leibler divergence from q to RBM;t .j^ when q is drawn from a prior 
on the probability simplex A (A"). The expectation value of the divergence for such submodels and 
any Dirichlet priors has been computed in li24J . 



6 Dimension 



In this section we study the dimension of the models 'RSM.x.y- Our analysis builds on previ- 
ous work by Cueto, Morton, and Sturmfels |7 |, where the binary case was treated. The idea is to 
bound the dimension from below by the dimension of a related max-plus model, called the tropi- 
cal RBM lIZTll . and from above by the dimension expected from counting parameters. One reason 
RBMs are attractive is that they have a large learning capacity, e.g. may be built with millions of 
parameters. Dimension calculations show whether those parameters are wasted, or translate into 
higher-dimensional spaces of representable distributions. 

The dimension of the discrete RBM model can be bounded from above not only by its expected 
dimension, but also by a function of the dimension of its Hadamard factors: 

Proposition 19. The dimension of the discrete RBM is bounded as 

dim(RBM;f^j;) < dim(A^;t',|3;,|) + X! dim(A^;t,|j;,|-i) + (m - 1) for all i e [m] . 

je[m]\{i} 

Hence RBM;^ j; can have the expected dimension only if (i) the right hand side equals \X\ — 1, or 
(ii) each mixture model Aix.k has the expected dimension for all k ~ 1, . • . , maxj \yj\. 



Proof. Note that £x ° — £x and hence £x o Mx.k — -Mx.k- Let u denote the uniform 
distribution. By Proposition [8] 

RBMx,y = {Mx,\y,\) ° (Aiu + (1 - \i)Mx,\y,\) o • • • o {K^u + (1 - \„,)Mx,\y^\-i) , 
from which the claim follows. □ 

Example 20. Consider an RBM with only two visible variables. The set of M x N matrices of rank 
at most k has dimension k{AI + iV — A:) for all 1 < fc < min{A/, A^}. Hence the fc-mixture of the 
independence model of two variables has dimension less than the number of parameters whenever 

1< A: < mm{\Xi\,\X2\}. 



By Proposition [191 if (E,gm(I3^jI " 1) + + 1-^21 - 1) < \Xi x and 1 < |3^,| < 



min{|A'i|, |A'2|} lor some j g [m], then 'RBMxj^xX2.y does not have the expected dimension. 

(X) 

We say that a set Z C A" is full rank when the matrix with columns {Ax . x £ Z} has full rank. 

Theorem 21. The model RBMxy has dimension {l+T,ie[n]i\^i I " 1)) (1 + Eig[m] I " 1)) " 1' 
as expected from counting parameters, whenever X contains m disjoint Hamming balls of radii 
2(|3^j I — 1) — 1, J € [to] and the subset of X not contained in these balls has full rank. On the other 
hand, if m Hamming balls of radius one cover X, then dim(RBM;f j;) — \X\ — 1. 

In order to prove this theorem we will need two main tools: slicings by normal fans of simplices 
(Section |4]l, and the tropical RBM model, described in Section |7] The theorem will follow from the 
analysis contained in the next section. 



7 Tropical model 

Lelpe{v) ^Y.hPei'"^^) = i Eh '2xp((6', ^[f ft"^^)), ^ e be aparametrization of RBM^-.^;. 

Definition 22. The tropical model RBM'™p|;''' is the image of the tropical morphism which 

evaluates log(pe(v)) — ^og(J2 iiPe{v,h)) for all v £ X and 6 E M.'^ within the max-plus algebra 
(addition becomes a + b — max{a, b}) and only up to additive constants independent of v (i.e., 
disregarding the normaUzation constant Z). 
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The idea behind this definition is that log(exp(a) + exp(6)) w maxja, b}, when a and b have a 
different order of magnitude. We have 

^{v;9) =max{{e,A'^'^'^^^): hey} for aUv e X,e eW^ . (14) 

The tropical model captures important properties of the original model. Of particular interest is the 
relation 

dim(RBM'™Pj;'^') < dim(RBMAr,j;) < min{dim{£x.y), \X\ - 1} , (15) 
which gives us a tool to estimate the dimension of the discrete RBM model. 



The following Theorem 23 describes the regions of linearity of $. This allows us to express the 
dimension of RBM*^ j;'^" as the maximum rank of a class of matrices defined by J^j-slicings (see 

Tol of the set {Aif^''}^ for all j e [to]. 



For each j G [to], let Cj — {Cj,i, . . . , Cj_\y.\} be a J^j-slicing of {Ax . x £ X}. Let A^.. ^ be of 
A"^ with the columns corresponding to points not in Cj^k zeroed and Acj = {Acj 1 1 • • • j^Cj |y| )• 
The matrix Acj is \X\ x \yj\dx- Let d = Eje[m] I^jMat- 

Theorem 23. On each region of linearity corresponding to a collection ofm y.j-slicings, the tropical 
morphism ^-.W^ ^ RBM'Jj^P-j,""' is the linear map represented by the \X\ x d-matrix 

A^{AcA---\AcJ. 

modulo constant functions. In particular dim(RBM'^^'^^') + 1 is the maximum rank of A over all 
possible collections ofm yj-slicings. 

Proof. Again use the homogeneous version of the matrix A^'^--^^ as in the proof of Proposition 
|8j this will not affect the rank of A. Let 9hj = iO{j,i},{h.i,xi))ie[n],xieXi and denote by Ah. the 
submatrix of A^"^'-^^ containing the rows with indices {hj,Xi): i E [n], Xi £ Xi\. For a 

given w e A" we have 

max{(0, aJJ;;^^) -.h^y}^ max{(^;,^ , A,,^ {v, h,)) : e y,} . □ 

je[m] 

In the following we evaluate the maximum rank of the matrix A for various choices of X and y by 
examining good slicings. 

Lemma 24. For any x* G X and < k < n the affine hull of the set {Ax : dnix, x*) = k} has 
dimension J2ie[n]i\'^i \ - 1) - 1- 

Proof The set := {Ai^^ : da {x, X* ) = k{ is the subset of vertices of the product of simplices 
Qx contained in the hyperplane := {z: (1,2) — k + 1}. We have that conv(Z'') — Qx H H'^, 
because if not, would slice an edge of Qx- On the other hand the two vertices of any edge 
of Qx lie in two parallel hyperplanes and and hence (1, z) ^ N for any point z in the 

relative interior of an edge of Q;^- The set is not contained in any proper face of Qx and hence 
conv(Z'') intersects the interior of Q;^- Thus dim(conv(Z''')) — dim(Q;f ) — 1, as was claimed. □ 

^ ( x\ 

If Z is a radius-one Hamming ball in X, then the 1 + X]ie[n](l'^j| ^ ^) vectors Ax , x £ Z aie 



affinely independent. Lemma 24 implies the following. 
Corollary 25. Let x £ X, and 2fc — 3 < n. There is a slicing Ci — {Ci^i, . . . , Ci.jt} of X by 
fc — 1 parallel hyperplanes such that uf~^C'i^i — Bx(2k — 3) is the Hamming ball of radius 2fc — 3 
centered at x, and Aq^ = (^Ci 1 1 ' ' ' j^Ci k-i) has full rank. 

Recall that 2t( A", d) denotes the maximal cardinality of a subset of X of minimum Hamming distance 
at least d. When X — {0, 1, . . . , g — 1} we write 2lq(n, d). Let d) denote the minimal 
cardinality of a subset of X with covering radius d. 
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Proposition 26.a (Binary visible units). Let X — {0, 1}" and \yj\ = Sj, j G [m]. If X contains 
m disjoint Hamming balls of radii 2s j — 3, j G [m] whose complement has maximum rank, then 
RBM^J'^J,'^^' has the expected dimension, minl^^gj^^j [sj — l)(?i + 1) + n, 2" — 1}. 

In particular, ify — {0, 1, . . . , s — 1}™ awe/ to < 2l2(ri, d),d = 4(s — 1) — 1, f/ien RBM;f j; /zas 
the expected dimension. It is known that ^2in', d) > 2"^^'°^2(II]j=o ( j ))!_ 
Proposition 26.b (Binary hidden units). Lef 3^ = {0, 1}™. 

• //m + 1 < 2l(A',3), f/ien RBMJ^°p'q''^j„ has dimension (1 +to)(1 + X;,;g[„] (1-^*1 - 1)) - 1- 

• Ifm + 1 > R{X , 1), then RBM'_!j?''|p'''|^j,„ /las dimension \X\ — 1. 
Lef 3^ = {0, 1}™ ant/ A" = {0, 1, . . . , q — 1}", where q is a prime power 

• If m + 1 < g"-ri°g,(i+("-i)(9-i)+i)l, then RBM'^°p|;'^'' has dimension (1 + to)(1 + 
E.sNd-^^l-l))-!- 

• Ifn = [q" - l)/{q - I) for some r > 2, then Ax{i) = ^{X , 1), and RBM'^°''-j,'''' has the 
expected dimension for any m. 

When both hidden and visible units are binary and to < 2"-riog2(»+i)l , then RBMxy has the 
expected dimension. 

Proposition 26.c (Arbitrary sized units). If X contains m disjoint Hamming balls of radii 2\yi \ — 
3, . . . , 2|3^m| — 3, and the complement of their union has full rank, then RBM'_^''-J;'^''' has the expected 
dimension. 



Proof. Propositions 26. a 26.b and 26. c follow from Theorem 23 and Corollary 25 together with 
the following explicit bounds on 2t. 

The g-ary Hamming codes are perfect linear codes over the finite field of length n — {q^' — 
l)/{q — 1) minimum Hamming distance three and covering radius one. 

Furthermore, if g is a prime power, 'Qiq{n, d) > q^, where k is the 



largest integer with q^ < 



/n-iv „iv (Gilbert- Vai-shamov HI] El), hi particular, 2l2(n, 3) > 
2'', where k is the largest integer with 2*^ < = 2"^'°S2(")^ i.e., k = n - [log2(n + 1)] . □ 

Example 27. Many results in coding theory can now be translated directly to a statement about 
the dimension of discrete RBMs. Here is an example. Let X = {1, 2. . . . , s} x {1, 2, . . . , s} x 
{1,2,..., t}, s < t. The minimum cardinality K{X, 1) of a code C C X with covering radius one 

equals — ^^''~*'> j if i < 3s, and otherwise, see ||6l Theorem 3.7.4]. Hence KBMx,{q.i}"^ 
has dimension lA"! — 1 when to + 1 > — „ and t < 3s, and when to + 1 > and t > 3s. 



8 Discussion 



We generalize various theoretical results on binary RBMs and naive Bayes models to the more 
general class of discrete RBMs. We highlight and contrast geometric and combinatorial properties 
of distributed products of experts and non-distributed mixtures of experts. In particular, we estimate 
the number of hidden units for which these models are universal approximators, depending on the 
cardinalities of the state spaces of all units. Moreover, we show that the maximal Kullback-Leibler 
approximation errors of these models are bounded from above by the expression log(nig[„] ~ 
log(maxi \Xi\) — ^og{J2jfz[m]i\yj\ ~ 1))' respectively vanish when the expression is not positive 
(Theorem 17 1. This generalizes [25 , Theorem 5.1], which states that the maximal divergence to the 
binary model decreases at least logarithmically with the number of hidden units. And it shows that 
the maximal approximation error decreases at least logarithmically in the number of states of each 
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hidden node. We computed exponential subfamilies of RBM;t'.j^ which can be used to estimate the 
expected approximation errors as well. We discussed inference functions of these models in terms 
of normal fans of products of simplices. 

We discuss the combinatorics of the tropical versions of discrete RBMs, and use this to show that 
the model RBM;t.j^ has the expected dimension for many choices of X and y. On the other hand, 
as Hadamard product of naive Bayes models, RBM;t',i^ has dimension less than expected whenever 



for some j e [m] the |3^j | -mixture of Sx has dimension less than expected (Proposition 19 1 



Various questions remain unsettled: What is the exact dimension of the naive Bayes models with 
general discrete variables? What is exactly the smallest number of hidden variables that make an 
RBM a universal approximator? Does a binary RBM always have the expected dimension? The 
geometric -combinatorial picture presented in this paper may be helpful in solving these problems. 
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